Introduction

In order to take advantage of simulations and astronomical data, it is important for astromoners to know how far objects are from earth. This project looks to explore the distances of astronomical objects from earth, and how this is related to various qualities that are easier to measure. We will look into a measurement of disance, in the form of redshift, and its relationship to object magnitude at various bandpasses. Redshift is a measurement of distance found by comparing the wavelength of the photon when it leaves the galaxy, versus its wavelength as it is measured on Earth. Our objects consist of both luminious red galaxies, which are outside of the Milky Way galaxy, and stars, inside the galaxy. Despite their differences in location, some stars and LRGs have similar colors and magnitudes. Thus, we hope to classify the objects as either stars or luminous red galaxies and predict the redshift using our various predictor variables.

In this project I will attempt to discover the relationship between an objects class (either Star or LRG), redshift and its colors and magnitude at different bandpasses.

Data Description

This Data was taken from the Extended Baryon Oscillation Spectroscopic Survey (eBOSS). This was a part of the third phase of the Sloan Digital Sky Survey (SDSS-III). The data are measurements from 45,664 physical objects. These objects are either stars in the Milky Way galaxy, or luminous red galaxies (LRGs). There are seven predictors, different magnitudes of the object measured at different bandpasses. However, we dropped one predictor due to issues with the data. There is one response variable, redshift. Redshift refers the amount of a shift toward longer wavelengths due to objects being very far away. Redshift (\(z\)) is defined as:

\[1 + z = \frac{\lambda_{\rm obs}}{\lambda_{\rm emit}}\]

Even though we only have one measured response value, we will also be using an objects class as a response variable. This is found by filtering objects which have a redshift less than or equal to 0.01, which is essentially 0, as stars. Then, objects with redshift greater than 0.01 are LRGs because at that point we are confident the object is not in the Milky Way.

Below we include the predictors we used in our final model. We focused on mostly colors, which are found by taking the difference of magnitudes at various bandpasses.

Variable Description
mag.i The magnitude of the object in SDSS bandpass at 850nm, “infrared”
color.ug The difference in magnitude of the object ultraviolet and infrared bandpasses
color.gr The different in magnitude of the objects in green and red bandpass
color.ri The different in magnitude of the objects in red and infrared bandpass
color.iz The different in magnitude of the objects two different infrared bandpass
color.zW1 The different in magnitude of the objects in an infrared SDSS and WISE mid-infrared bandpass

EDA

Redshift

The histogram of redshift shows bimodiality. The majority of the redshift data is symmetrically centered about 0.75, however there is a subset of the data that seems centered around 0. This cluster of data shows the population of stars we will attempt to classify and filter out before our prediction of redshift.

Predictors

Here we see a general shape of the predictors. On the left is color.ri. This distribution is strongly skewed right, similarly to color.iz, and mag.i. On the right is color.zW1, this distribution is relatively symmetric and normal, relatively similar to color.ug and color.gr.

From this we see the relationship between color.iz and color.ri. This relationship is indicative of the relationship between essentially all of the other predictors. Most of them also have similar point cloud structures, as shown above, with a very slight upward or downward trend. These graphs also show the extent of overlap of stars (in green) and LRGs (in blue) in this predictor space.

Predictors Compared to Redshift

These graphs show us getting closer to a relationship between the colors and redshift, though we are only looking at LRGs in this case. On the left shows a somewhat negative relationship bewteen color.gr and redshift when we look at only the most dense area of data points. A similar relationship was seen between color.ug, color.ri and color.iz. On the right we see a point cloud with a slight postive relationship between color.zW1 and redshift. The relationship types aren’t clear, so we will explore both linear and non-linear fits with our models. We also see that the density of stars (in green) is similar to the density of LRGs (in blue) when looking at color.gr. However, when looking at color.zW1 there is a slight difference in the density of stars and LRGs, stars having somewhat lower values of color.zW1.

Here we see there is a fairly strong linear correlation between mag.i and each of the colors. There also seems to be a slightly linear correlation between color.iz and color.zW1. However, the other predictors seem to have more complex correlations.

This plot also shows that there are some distinct grouping effects even in two predictor spaces. There is pretty strong separation between stars and LRGs when looking at color.zW1 and all the other colors, along with color.iz compared to both color.ug and color.ri. Based on these results, when we look at multi-dimensional predictor space we can hope to find an accurate way to classify object as either Stars or LRGs.

This graph shows that stars, represented by the green lines, seem to converge most at color.zW1 and color.iz than the other predictors. This gives us insight into which predictors might prove to be important when classifying objects in this dataset.

PCA

We decided to use PCA to explore cutting down on the dimensionality of the data for our regression step However the screeplot suggested that using PCAs 1-5 will give us approximately 90% of the relationships in the data. This does not give significant reduction in dimensionality, so we will proceed with all the data and attempt to choose important predictors through our prediction.

Model

Our final model will be a two step model, first we predict whether or not the object is a star, then we predict redshift. However, we wanted to build some intuition so we started by exploring our data through classification and regression first.

Classification

Our first step was to handle the classification problem. We ran multiple classification models and focused on the models with the lowest misclassification rate (MCR). We decided to focus on random forest and gradient boosting because, as shown in the table below, they gave us the lowest MCRs. Since our data has very unbalanced classes we thought it may be important to look into balancing the classes so that our model would more accurately classify stars. However, we found this was not helpful. When we ran random forest with balanced classes for mag.i we got a larger MCR, 0.233 compared to 0.084 when we did not balance out the classes. So, we will accept that our model is inherently better at classifying LRGs due to the breakdown of our data.

Model MCR
Logistic Regression 0.0967
LDA 0.0969
Random Forest 0.0864
Gradient Boosting 0.0879
K-nearest neighbors 0.0891
Subset Logistic Regression 0.0968

Regression

We also needed to look into models that give us low MSEs in regression alone. Below we see our MSEs for various models. From this chart we decided to focus on GAMs and Random Forest, as we get the lowest MSEs using these models.

Model MSE
Linear Regression 0.0256
Subset Regression 0.0256
Ridge Regression 0.0256
Lasso Regression 0.0256
GAMs 0.0244
GAMs with mag.i linear 0.0244
Random Forest 0.0245

We thought it was important to investigate if a subset GAMs would be beneficial as we noticed in the plot of our GAMs output that mag.i looked relatively linear. So we ran GAMs ANOVA leaving mag.i as a linear predictor and found it was significantly different than having all predictors non-linear. So, we tried a subset GAMs model with mag.i linear as well.

Analysis of Deviance Table ## Two Step Model Selection
Resid. Df Resid. Dev Df Deviance F Pr(>F)
20110 500.4 NA NA NA NA
20113 501.1 -3 -0.7692 10.3 8.965e-07

After running the relevant combinations of regressor and classifier, it was important to find a metric that took into account both the MCR from our classification model, and the MSE from our regression. However, we noticed that MCR and MSE were similar in value when running our bivariate model, so we simply used the sum. Therefore, we chose the model with the lowest sum. We landed on using our data with mag.i with gradient boosting to classify objects as stars or LRGs then random forest to predict the value of redshift for these predicted LRGs.

Model MCR MSE after classification Metric (Sum)
Random Forest and Random Forest 0.0831 0.0709 0.1540
Random Forest and GAMs 0.0831 0.0709 0.1540
Random Forest and Subset GAMs 0.0831 0.0708 .1539
Boosting and GAMs 0.0831 0.0663 0.1494
Boosting and Subset GAMs 0.0831 0.0670 .1501
Boosting and Random Forest 0.0831 0.0645 0.1476

Final Model

By looking at the chart above we see that the best two step model is the model using gradient boosting as a classifier and a random forest regressor with mag.i, as it has the smallest sum.

  LRG STAR
LRG 12192 987
STAR 152 369

We see from the confusion matrix that our model, as expected, is much better at classifying LRGs over stars. The ROC plot also shows that our model balances specificity and sensitivity. Meaning, our false positive rate is low but our true positive rate is high.

From the variable importance plot we see that color.zW1 is the most important variable, with color.iz the next most important. We also see there is not a large difference between the 3 least important variables.

We see this model is somewhat helpful in predicting redshift. There is a clear improvement in our model over just using the mean of the training redshifts (indicated by the red line). However, due to the large mass of points in the 0.5 to 1 range, it is difficult to see this trend, so on the left we included a zoomed in verson, to show there is information in our model. There is a mass of points across the bottom of the plot, which are indicative of us misclassifying objects as stars, and predicting a redshift of 0, when they are actually LRGs.

Conclusion

We managed to learn a model to effectively classify objects as either stars or LRGs, then predict redshift from this. Though there is some error in classfier and error in our regression model, which is compounded when we put the two models together, we still get meaningful results. We are able to classify objects with a misclassification rate of only 8.31% and an MSE of 0.06445,